This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
This makes data cleaning especially important and tricky.
You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.
Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
The data is contained in three files:
Here is the schema and explanation of each variable in the files:
portfolio.json
profile.json
transcript.json
Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.
You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.
import pandas as pd
import numpy as np
import math
import json
%matplotlib inline
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import fbeta_score, accuracy_score
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
Lets look at 3 different datasets.
portfolio.head()
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
portfolio.describe()
| reward | difficulty | duration | |
|---|---|---|---|
| count | 10.000000 | 10.000000 | 10.000000 |
| mean | 4.200000 | 7.700000 | 6.500000 |
| std | 3.583915 | 5.831905 | 2.321398 |
| min | 0.000000 | 0.000000 | 3.000000 |
| 25% | 2.000000 | 5.000000 | 5.000000 |
| 50% | 4.000000 | 8.500000 | 7.000000 |
| 75% | 5.000000 | 10.000000 | 7.000000 |
| max | 10.000000 | 20.000000 | 10.000000 |
portfolio.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reward 10 non-null int64 1 channels 10 non-null object 2 difficulty 10 non-null int64 3 duration 10 non-null int64 4 offer_type 10 non-null object 5 id 10 non-null object dtypes: int64(3), object(3) memory usage: 608.0+ bytes
portfolio.shape
(10, 6)
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | None | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | None | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | None | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
profile.describe()
| age | became_member_on | income | |
|---|---|---|---|
| count | 17000.000000 | 1.700000e+04 | 14825.000000 |
| mean | 62.531412 | 2.016703e+07 | 65404.991568 |
| std | 26.738580 | 1.167750e+04 | 21598.299410 |
| min | 18.000000 | 2.013073e+07 | 30000.000000 |
| 25% | 45.000000 | 2.016053e+07 | 49000.000000 |
| 50% | 58.000000 | 2.017080e+07 | 64000.000000 |
| 75% | 73.000000 | 2.017123e+07 | 80000.000000 |
| max | 118.000000 | 2.018073e+07 | 120000.000000 |
profile.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17000 entries, 0 to 16999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 14825 non-null object 1 age 17000 non-null int64 2 id 17000 non-null object 3 became_member_on 17000 non-null int64 4 income 14825 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 664.2+ KB
profile.shape
(17000, 5)
profile[profile['gender'].isnull()].count()
gender 0 age 2175 id 2175 became_member_on 2175 income 0 dtype: int64
profile[profile['income'].isnull()].count()
gender 0 age 2175 id 2175 became_member_on 2175 income 0 dtype: int64
profile.duplicated().sum()
0
profile['gender'].unique()
array([None, 'F', 'M', 'O'], dtype=object)
transcript.head()
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
transcript.describe()
| time | |
|---|---|
| count | 306534.000000 |
| mean | 366.382940 |
| std | 200.326314 |
| min | 0.000000 |
| 25% | 186.000000 |
| 50% | 408.000000 |
| 75% | 528.000000 |
| max | 714.000000 |
transcript.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 306534 entries, 0 to 306533 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person 306534 non-null object 1 event 306534 non-null object 2 value 306534 non-null object 3 time 306534 non-null int64 dtypes: int64(1), object(3) memory usage: 9.4+ MB
transcript.shape
(306534, 4)
transcript.isnull().sum()
person 0 event 0 value 0 time 0 dtype: int64
transcript['event'].unique()
array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
dtype=object)
Lets clean our datasets.
# check the outliers of age column
fig = px.box(profile, y="age")
fig.show()
# let's check the income of age 118
check_age = profile[profile['age']== 118]
check_age.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | None | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 2 | None | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 4 | None | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
| 6 | None | 118 | 8ec6ce2a7e7949b1bf142def7d0e0586 | 20170925 | NaN |
| 7 | None | 118 | 68617ca6246f4fbc85e91a2a49552598 | 20171002 | NaN |
check_age.describe()
| age | became_member_on | income | |
|---|---|---|---|
| count | 2175.0 | 2.175000e+03 | 0.0 |
| mean | 118.0 | 2.016804e+07 | NaN |
| std | 0.0 | 1.009105e+04 | NaN |
| min | 118.0 | 2.013080e+07 | NaN |
| 25% | 118.0 | 2.016070e+07 | NaN |
| 50% | 118.0 | 2.017073e+07 | NaN |
| 75% | 118.0 | 2.017123e+07 | NaN |
| max | 118.0 | 2.018073e+07 | NaN |
# As result we found that age 118 is a dummy data. there are 2175 rows in gender and income have no info, so we will drop them.
profile = profile[profile['age']<118].reset_index(drop=True)
profile.describe()
| age | became_member_on | income | |
|---|---|---|---|
| count | 14825.000000 | 1.482500e+04 | 14825.000000 |
| mean | 54.393524 | 2.016689e+07 | 65404.991568 |
| std | 17.383705 | 1.188565e+04 | 21598.299410 |
| min | 18.000000 | 2.013073e+07 | 30000.000000 |
| 25% | 42.000000 | 2.016052e+07 | 49000.000000 |
| 50% | 55.000000 | 2.017080e+07 | 64000.000000 |
| 75% | 66.000000 | 2.017123e+07 | 80000.000000 |
| max | 101.000000 | 2.018073e+07 | 120000.000000 |
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 1 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 2 | M | 68 | e2127556f4f64592b11af22de27a7932 | 20180426 | 70000.0 |
| 3 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 20180209 | 53000.0 |
| 4 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 20171111 | 51000.0 |
# convert (became_member_on) to date
profile.became_member_on = pd.to_datetime(profile.became_member_on, format = '%Y%m%d')
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017-07-15 | 112000.0 |
| 1 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 2017-05-09 | 100000.0 |
| 2 | M | 68 | e2127556f4f64592b11af22de27a7932 | 2018-04-26 | 70000.0 |
| 3 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 2018-02-09 | 53000.0 |
| 4 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 2017-11-11 | 51000.0 |
# keep only the year
profile['became_member_on'] = pd.Series([int(str(x)[:4]) for x in profile['became_member_on']])
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 |
| 1 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 2017 | 100000.0 |
| 2 | M | 68 | e2127556f4f64592b11af22de27a7932 | 2018 | 70000.0 |
| 3 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 2018 | 53000.0 |
| 4 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 2017 | 51000.0 |
# rename columns
profile = profile.rename(columns={'id':'customer_id'})
profile.head()
| gender | age | customer_id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 |
| 1 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 2017 | 100000.0 |
| 2 | M | 68 | e2127556f4f64592b11af22de27a7932 | 2018 | 70000.0 |
| 3 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 2018 | 53000.0 |
| 4 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 2017 | 51000.0 |
# convert day to hour
portfolio['duration']=portfolio['duration'] * 24
portfolio.head()
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 168 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 120 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 96 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 168 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 240 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
# rename id column to offer_id
portfolio=portfolio.rename(columns={'id':'offer_id'})
# one hot encoding for channel column
portfolio['email']=portfolio['channels'].apply(lambda x: 1 if 'email' in x else 0)
portfolio['mobile']=portfolio['channels'].apply(lambda x: 1 if 'mobile' in x else 0)
portfolio['social']=portfolio['channels'].apply(lambda x: 1 if 'social' in x else 0)
portfolio['web']=portfolio['channels'].apply(lambda x: 1 if 'web' in x else 0)
portfolio.drop(['channels'], axis=1, inplace=True)
portfolio.head()
| reward | difficulty | duration | offer_type | offer_id | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 10 | 168 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 1 | 1 | 0 |
| 1 | 10 | 10 | 120 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0 | 96 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 1 | 0 | 1 |
| 3 | 5 | 5 | 168 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 1 | 0 | 1 |
| 4 | 5 | 20 | 240 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1 | 0 | 0 | 1 |
transcript.head()
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
# divide value column into 2 due to different info
transcript['offer_id'] = [[*i.values()][0]if [*i.keys()][0] in ['offer id','offer_id'] else None for i in transcript.value]
transcript['amount'] = [np.round([*i.values()][0], decimals=2)if [*i.keys()][0] == 'amount' else None for i in transcript.value]
# rename
transcript=transcript.rename(columns={'person':'customer_id'})
# drop value column
transcript=transcript.drop(columns={'value'})
transcript.head()
| customer_id | event | time | offer_id | amount | |
|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | 2906b810c7d4411798c6938adc9daaa5 | NaN |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | 0 | fafdcd668e3743c1bb461111dcafc2a4 | NaN |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | 0 | 4d5c57ea9a6940dd891ad53e9dbe8da0 | NaN |
portfolio.head()
| reward | difficulty | duration | offer_type | offer_id | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 10 | 168 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 1 | 1 | 0 |
| 1 | 10 | 10 | 120 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0 | 96 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 1 | 0 | 1 |
| 3 | 5 | 5 | 168 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 1 | 0 | 1 |
| 4 | 5 | 20 | 240 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1 | 0 | 0 | 1 |
profile.head()
| gender | age | customer_id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 |
| 1 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 2017 | 100000.0 |
| 2 | M | 68 | e2127556f4f64592b11af22de27a7932 | 2018 | 70000.0 |
| 3 | M | 65 | 389bc3fa690240e798340f5a15918d5c | 2018 | 53000.0 |
| 4 | M | 58 | 2eeac8d8feae4a8cad5a6af0499a211d | 2017 | 51000.0 |
# Merge 1 (merge profile with transcript)
df = pd.merge(profile,transcript, on='customer_id')
df.head()
| gender | age | customer_id | became_member_on | income | event | time | offer_id | amount | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 | transaction | 18 | None | 21.51 |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 | transaction | 144 | None | 32.28 |
| 2 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 | offer received | 408 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 3 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 | offer received | 504 | 3f207df678b143eea3cee63160fa8bed | NaN |
| 4 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 2017 | 112000.0 | transaction | 528 | None | 23.22 |
# Renumber customer_id
customer_id_dict = df['customer_id'].unique()
customer_id_dict = pd.Series(customer_id_dict).to_dict()
customer_id_dict = dict([(v, k) for k, v in customer_id_dict.items()])
# Renumber offer_id
offer_id_dict = df['offer_id'].unique()
offer_id_dict = pd.Series(offer_id_dict).to_dict()
offer_id_dict = dict([(v, k) for k, v in offer_id_dict.items()])
offer_id_dict
{None: 0,
'9b98b8c7a33c4b65b9aebfe6a799e6d9': 1,
'3f207df678b143eea3cee63160fa8bed': 2,
'5a8bc65990b245e5a138643cd4eb9837': 3,
'ae264e3637204a6fb9bb56bc8210ddfd': 4,
'f19421c1d4aa40978ebb69ca19b0e20d': 5,
'2906b810c7d4411798c6938adc9daaa5': 6,
'fafdcd668e3743c1bb461111dcafc2a4': 7,
'2298d6c36e964ae4a3e7e9706d1fb8c2': 8,
'0b1e1539f2cc45b7b9fa7c272da2e1d7': 9,
'4d5c57ea9a6940dd891ad53e9dbe8da0': 10}
# Mapping the new ids to original df
df['customer_id'] = df['customer_id'].map(customer_id_dict)
df['offer_id'] = df['offer_id'].map(offer_id_dict)
# Mapping the new offer id to original portfolio dataset for future merge
portfolio['offer_id'] = portfolio['offer_id'].map(offer_id_dict)
df.head()
| gender | age | customer_id | became_member_on | income | event | time | offer_id | amount | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 55 | 0 | 2017 | 112000.0 | transaction | 18 | 0 | 21.51 |
| 1 | F | 55 | 0 | 2017 | 112000.0 | transaction | 144 | 0 | 32.28 |
| 2 | F | 55 | 0 | 2017 | 112000.0 | offer received | 408 | 1 | NaN |
| 3 | F | 55 | 0 | 2017 | 112000.0 | offer received | 504 | 2 | NaN |
| 4 | F | 55 | 0 | 2017 | 112000.0 | transaction | 528 | 0 | 23.22 |
portfolio.head()
| reward | difficulty | duration | offer_type | offer_id | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 10 | 168 | bogo | 4 | 1 | 1 | 1 | 0 |
| 1 | 10 | 10 | 120 | bogo | 10 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0 | 96 | informational | 2 | 1 | 1 | 0 | 1 |
| 3 | 5 | 5 | 168 | bogo | 1 | 1 | 1 | 0 | 1 |
| 4 | 5 | 20 | 240 | discount | 9 | 1 | 0 | 0 | 1 |
# df1 shows us the only offer's transactions
df1 = pd.merge(df, portfolio, on='offer_id')
df1 = df1.drop(columns={'amount'})
df1.head()
| gender | age | customer_id | became_member_on | income | event | time | offer_id | reward | difficulty | duration | offer_type | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 55 | 0 | 2017 | 112000.0 | offer received | 408 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 1 | F | 55 | 0 | 2017 | 112000.0 | offer completed | 528 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 2 | F | 75 | 1 | 2017 | 100000.0 | offer received | 0 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 3 | F | 75 | 1 | 2017 | 100000.0 | offer viewed | 6 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 4 | F | 75 | 1 | 2017 | 100000.0 | offer completed | 132 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
df1.shape
(148805, 16)
#check events
df1['event'].unique()
array(['offer received', 'offer completed', 'offer viewed'], dtype=object)
# df1 shows us the only offer's transactions
df1.describe()
| age | customer_id | became_member_on | income | time | offer_id | reward | difficulty | duration | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 148805.000000 | 148805.000000 | 148805.000000 | 148805.000000 | 148805.000000 | 148805.000000 | 148805.000000 | 148805.000000 | 148805.000000 | 148805.0 | 148805.000000 | 148805.000000 | 148805.000000 |
| mean | 54.779282 | 7408.517308 | 2016.576365 | 66414.119149 | 354.570223 | 5.706703 | 4.442445 | 7.890561 | 159.004980 | 1.0 | 0.917160 | 0.658311 | 0.806747 |
| std | 17.193533 | 4278.063634 | 1.186534 | 21496.947967 | 198.311301 | 2.796607 | 3.372362 | 5.041335 | 51.192852 | 0.0 | 0.275641 | 0.474277 | 0.394851 |
| min | 18.000000 | 0.000000 | 2013.000000 | 30000.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 72.000000 | 1.0 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 43.000000 | 3705.000000 | 2016.000000 | 51000.000000 | 168.000000 | 4.000000 | 2.000000 | 5.000000 | 120.000000 | 1.0 | 1.000000 | 0.000000 | 1.000000 |
| 50% | 56.000000 | 7400.000000 | 2017.000000 | 65000.000000 | 408.000000 | 6.000000 | 5.000000 | 10.000000 | 168.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 |
| 75% | 67.000000 | 11110.000000 | 2017.000000 | 81000.000000 | 510.000000 | 8.000000 | 5.000000 | 10.000000 | 168.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 |
| max | 101.000000 | 14824.000000 | 2018.000000 | 120000.000000 | 714.000000 | 10.000000 | 10.000000 | 20.000000 | 240.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 |
df2=df.drop(columns={'offer_id'})
# df3 shows us w/o any offers transactions so only transaction events
df3=df2.dropna(subset=['amount'])
df3.head()
| gender | age | customer_id | became_member_on | income | event | time | amount | |
|---|---|---|---|---|---|---|---|---|
| 0 | F | 55 | 0 | 2017 | 112000.0 | transaction | 18 | 21.51 |
| 1 | F | 55 | 0 | 2017 | 112000.0 | transaction | 144 | 32.28 |
| 4 | F | 55 | 0 | 2017 | 112000.0 | transaction | 528 | 23.22 |
| 8 | F | 75 | 1 | 2017 | 100000.0 | transaction | 132 | 19.89 |
| 10 | F | 75 | 1 | 2017 | 100000.0 | transaction | 144 | 17.78 |
# df3 shows us w/o any offers transactions
df3.describe()
| age | customer_id | became_member_on | income | time | amount | |
|---|---|---|---|---|---|---|
| count | 123957.000000 | 123957.000000 | 123957.000000 | 123957.000000 | 123957.000000 | 123957.000000 |
| mean | 52.713965 | 7416.438079 | 2016.314666 | 61843.510250 | 381.278201 | 13.996325 |
| std | 17.906667 | 4277.226631 | 1.241192 | 20661.890394 | 201.829085 | 31.749420 |
| min | 18.000000 | 0.000000 | 2013.000000 | 30000.000000 | 0.000000 | 0.050000 |
| 25% | 39.000000 | 3721.000000 | 2016.000000 | 46000.000000 | 210.000000 | 3.660000 |
| 50% | 54.000000 | 7424.000000 | 2017.000000 | 60000.000000 | 402.000000 | 10.800000 |
| 75% | 65.000000 | 11105.000000 | 2017.000000 | 74000.000000 | 552.000000 | 19.130000 |
| max | 101.000000 | 14824.000000 | 2018.000000 | 120000.000000 | 714.000000 | 1062.280000 |
# df shows us whole transactions
df.describe()
| age | customer_id | became_member_on | income | time | offer_id | amount | |
|---|---|---|---|---|---|---|---|
| count | 272762.000000 | 272762.000000 | 272762.000000 | 272762.000000 | 272762.000000 | 272762.000000 | 123957.000000 |
| mean | 53.840696 | 7412.116911 | 2016.457435 | 64337.000755 | 366.707694 | 3.113286 | 13.996325 |
| std | 17.551337 | 4277.677254 | 1.218663 | 21243.762941 | 200.359087 | 3.512949 | 31.749420 |
| min | 18.000000 | 0.000000 | 2013.000000 | 30000.000000 | 0.000000 | 0.000000 | 0.050000 |
| 25% | 41.000000 | 3712.000000 | 2016.000000 | 48000.000000 | 186.000000 | 0.000000 | 3.660000 |
| 50% | 55.000000 | 7409.000000 | 2017.000000 | 62000.000000 | 408.000000 | 1.000000 | 10.800000 |
| 75% | 66.000000 | 11109.000000 | 2017.000000 | 78000.000000 | 528.000000 | 6.000000 | 19.130000 |
| max | 101.000000 | 14824.000000 | 2018.000000 | 120000.000000 | 714.000000 | 10.000000 | 1062.280000 |
df1_model = df1.copy(deep=True)
df1_model["gender"][df1_model["gender"] == "M"] = 1
df1_model["gender"][df1_model["gender"] == "F"] = -1
df1_model["gender"][df1_model["gender"] == "O"] = 0
df1_model.head()
| gender | age | customer_id | became_member_on | income | event | time | offer_id | reward | difficulty | duration | offer_type | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 55 | 0 | 2017 | 112000.0 | offer received | 408 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 1 | -1 | 55 | 0 | 2017 | 112000.0 | offer completed | 528 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 2 | -1 | 75 | 1 | 2017 | 100000.0 | offer received | 0 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 3 | -1 | 75 | 1 | 2017 | 100000.0 | offer viewed | 6 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 4 | -1 | 75 | 1 | 2017 | 100000.0 | offer completed | 132 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
df1_model.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 148805 entries, 0 to 148804 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 148805 non-null object 1 age 148805 non-null int64 2 customer_id 148805 non-null int64 3 became_member_on 148805 non-null int64 4 income 148805 non-null float64 5 event 148805 non-null object 6 time 148805 non-null int64 7 offer_id 148805 non-null int64 8 reward 148805 non-null int64 9 difficulty 148805 non-null int64 10 duration 148805 non-null int64 11 offer_type 148805 non-null object 12 email 148805 non-null int64 13 mobile 148805 non-null int64 14 social 148805 non-null int64 15 web 148805 non-null int64 dtypes: float64(1), int64(12), object(3) memory usage: 19.3+ MB
offer_type_dict = df1_model['offer_type'].unique()
offer_type_dict
array(['bogo', 'informational', 'discount'], dtype=object)
df1_model["offer_type"][df1_model["offer_type"] == "bogo"] = 1
df1_model["offer_type"][df1_model["offer_type"] == "informational"] = -1
df1_model["offer_type"][df1_model["offer_type"] == "discount"] = 0
#encode 'event' data to numerical values
df1_model['event'] = df1_model['event'].map({'offer received':0, 'offer viewed':1, 'offer completed':2})
df1_model.head()
| gender | age | customer_id | became_member_on | income | event | time | offer_id | reward | difficulty | duration | offer_type | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 55 | 0 | 2017 | 112000.0 | 0 | 408 | 1 | 5 | 5 | 168 | 1 | 1 | 1 | 0 | 1 |
| 1 | -1 | 55 | 0 | 2017 | 112000.0 | 2 | 528 | 1 | 5 | 5 | 168 | 1 | 1 | 1 | 0 | 1 |
| 2 | -1 | 75 | 1 | 2017 | 100000.0 | 0 | 0 | 1 | 5 | 5 | 168 | 1 | 1 | 1 | 0 | 1 |
| 3 | -1 | 75 | 1 | 2017 | 100000.0 | 1 | 6 | 1 | 5 | 5 | 168 | 1 | 1 | 1 | 0 | 1 |
| 4 | -1 | 75 | 1 | 2017 | 100000.0 | 2 | 132 | 1 | 5 | 5 | 168 | 1 | 1 | 1 | 0 | 1 |
df1_model.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 148805 entries, 0 to 148804 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 148805 non-null object 1 age 148805 non-null int64 2 customer_id 148805 non-null int64 3 became_member_on 148805 non-null int64 4 income 148805 non-null float64 5 event 148805 non-null int64 6 time 148805 non-null int64 7 offer_id 148805 non-null int64 8 reward 148805 non-null int64 9 difficulty 148805 non-null int64 10 duration 148805 non-null int64 11 offer_type 148805 non-null object 12 email 148805 non-null int64 13 mobile 148805 non-null int64 14 social 148805 non-null int64 15 web 148805 non-null int64 dtypes: float64(1), int64(13), object(2) memory usage: 19.3+ MB
df1_model = df1_model.drop(columns = ["customer_id","email","mobile","social","web"])
df1_model.head()
| gender | age | became_member_on | income | event | time | offer_id | reward | difficulty | duration | offer_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 55 | 2017 | 112000.0 | 0 | 408 | 1 | 5 | 5 | 168 | 1 |
| 1 | -1 | 55 | 2017 | 112000.0 | 2 | 528 | 1 | 5 | 5 | 168 | 1 |
| 2 | -1 | 75 | 2017 | 100000.0 | 0 | 0 | 1 | 5 | 5 | 168 | 1 |
| 3 | -1 | 75 | 2017 | 100000.0 | 1 | 6 | 1 | 5 | 5 | 168 | 1 |
| 4 | -1 | 75 | 2017 | 100000.0 | 2 | 132 | 1 | 5 | 5 | 168 | 1 |
df1.head()
| gender | age | customer_id | became_member_on | income | event | time | offer_id | reward | difficulty | duration | offer_type | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 55 | 0 | 2017 | 112000.0 | offer received | 408 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 1 | F | 55 | 0 | 2017 | 112000.0 | offer completed | 528 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 2 | F | 75 | 1 | 2017 | 100000.0 | offer received | 0 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 3 | F | 75 | 1 | 2017 | 100000.0 | offer viewed | 6 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
| 4 | F | 75 | 1 | 2017 | 100000.0 | offer completed | 132 | 1 | 5 | 5 | 168 | bogo | 1 | 1 | 0 | 1 |
plt.ylabel("Frequency")
plt.xlabel("Age")
plt.hist(df1['age'], color = "m", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Age")
plt.hist(df3['age'], color = "m", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Become member in which year")
plt.hist(df1['became_member_on'], color = "c", bins=12);
plt.ylabel("Frequency")
plt.xlabel("Become member in which year")
plt.hist(df3['became_member_on'], color = "c", bins=12);
plt.ylabel("Frequency")
plt.xlabel("Income")
plt.hist(df1['income'], color = "g", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Income")
plt.hist(df3['income'], color = "g", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Offers time")
plt.hist(df1['time'], color = "c", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Offers completed")
plt.hist(df1[df1['event']=="offer completed"]['time'], color = "c", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Offers recieved in time")
plt.hist(df1[df1['event']=="offer received"]['time'], color = "b", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Offers viewed in time")
plt.hist(df1[df1['event']=="offer viewed"]['time'], color = "b", bins=24);
plt.ylabel("Frequency")
plt.xlabel("Time")
plt.hist(df3['time'], color = "b", bins=24);
px.histogram(df1, x="offer_type", color="gender",barmode='group')
px.histogram(df3, x="gender", color="gender",barmode='group')
px.histogram(df1, x="income", color="gender")
px.histogram(df1, x="time", color="event")
px.histogram(df1, x="offer_type", color="event",barmode='group', title="Offers vs Event Distributions")
px.histogram(df1, x="offer_id", color="event",barmode='group', title="Detailed Offers vs Event Distributions")
px.histogram(df1, x="time", color="offer_id")
px.histogram(df1, x="event", color="gender", barmode='group')
px.histogram(df1[df["gender"] == "M"], x="offer_id", color="event", histfunc="count",barmode='group')
px.histogram(df1[df["gender"] == "F"], x="offer_id", color="event", histfunc="count",barmode='group')
px.histogram(df1, x="age", color="offer_id")
px.histogram(df1, x="offer_type", color="became_member_on", barmode='group')
px.histogram(df1, x="event", color="became_member_on", barmode='group')
px.histogram(df1, x="income", color="offer_id")
When we examine offers sent and transactions data independently, we find that their demographics are almost identical.
Age and year of membership distributions are substantially identical across both datasets.
For 2 datasets, income is considerably different, especially when it is more than $75,000. In other words, 75 to 100K clients with higher incomes received more offers from Starbucks as a proportion. This group of clients was their aim.
Offers weren't delivered every day. Following the second offer, the interval was lowered. On the other side, after day 0 of continual offer sending, offer receipts would decline.
Men receive more offers than women do in terms of quantity.
Distribution data indicate that women earn more than males.
1) Income distribution are quit different, specially higher than 75k, it mean that Starbucks has more target for customers who have income between 75k and 100k.
2) Starbucks are not sending offers everyday, it usually decrease after second offer which meand, offers are going to decrease when customers not continuing accepting offers
3) Transactions w/o offers has increased with the increased frequency of offers
4) Men get more offers.
5) The ratio of completing the offers for women are higher than men.
6) Discount and bogo has almost same offer recieved, but bogo has higher offer viewes, overall discount has more slicely offer completed
7) Men tends to complete offers if it has high difficulity more than women, also
8) Men and women sharing the same first 3 difficulities
9) Women are not sensitive to offer difficulties during age, also
10) Men prefer low diffidulities for men under age of 70, and higher can prefer high difficulties once
11) Starbucks having same strategy for whole users, we can notice increasing in completing offer for the first 3 years
12) In 2016 and 2018 completing offers are getting decreased, offers should be more clear for the old customers to retain their completing new offers.
13) Men who have 35K to 50K income have different choices on different offers.
14) Women who have 65K to 85K income have different choices on different offers.
Men customers between the ages of 20 and 45 responded to offers 7 and 8 more favorably than other offers.
Men customers between the ages of 45 and 60 responded to alternative offers instead of 7 and 8.
Men customers with incomes between $35k and $50k responded to different offers instead of 4,6,9.
——-
Women tend to devalue offers but are less sensitive to the types of offers. Expand the discounts available to women.
Women customers with incomes between $65,000 and $85,000 responded to offers 1 and 10.
———-
Concentrate primarily on offers 7,8,5, and 1. Higher retention rate.
In general, the retention rate is higher for third- and fourth-year members.
sns.heatmap(df1_model.corr(method="kendall"),annot=True, fmt="0.2f");
df1_model.head()
| gender | age | became_member_on | income | event | time | offer_id | reward | difficulty | duration | offer_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 55 | 2017 | 112000.0 | 0 | 408 | 1 | 5 | 5 | 168 | 1 |
| 1 | -1 | 55 | 2017 | 112000.0 | 2 | 528 | 1 | 5 | 5 | 168 | 1 |
| 2 | -1 | 75 | 2017 | 100000.0 | 0 | 0 | 1 | 5 | 5 | 168 | 1 |
| 3 | -1 | 75 | 2017 | 100000.0 | 1 | 6 | 1 | 5 | 5 | 168 | 1 |
| 4 | -1 | 75 | 2017 | 100000.0 | 2 | 132 | 1 | 5 | 5 | 168 | 1 |
df1_model_Q2 = df1_model[df1_model['gender']==1].reset_index(drop=True)
sns.heatmap(df1_model_Q2.corr(method="kendall"), annot=True, fmt="0.2f");
df1_model_y = df1_model.copy(deep=True)
X = df1_model.drop(columns=["event"])
y = df1_model.drop(columns = ["age","became_member_on","gender","income","time",
"offer_id","difficulty","duration","offer_type","reward"])
y.head()
| event | |
|---|---|
| 0 | 0 |
| 1 | 2 |
| 2 | 0 |
| 3 | 1 |
| 4 | 2 |
scaler = MinMaxScaler()
def testandtrain(X,y,category=False, channel =False):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
X_train.age = scaler.fit_transform(X_train.age.values.reshape(-1, 1))
X_test.age = scaler.fit_transform(X_test.age.values.reshape(-1, 1))
X_train.became_member_on = scaler.fit_transform(X_train.became_member_on.values.reshape(-1, 1))
X_test.became_member_on = scaler.fit_transform(X_test.became_member_on.values.reshape(-1, 1))
X_train.income = scaler.fit_transform(X_train.income.values.reshape(-1, 1))
X_test.income = scaler.fit_transform(X_test.income.values.reshape(-1, 1))
X_train.time = scaler.fit_transform(X_train.time.values.reshape(-1, 1))
X_test.time = scaler.fit_transform(X_test.time.values.reshape(-1, 1))
X_train.offer_id = scaler.fit_transform(X_train.offer_id.values.reshape(-1, 1))
X_test.offer_id = scaler.fit_transform(X_test.offer_id.values.reshape(-1, 1))
X_train.difficulty = scaler.fit_transform(X_train.difficulty.values.reshape(-1, 1))
X_test.difficulty = scaler.fit_transform(X_test.difficulty.values.reshape(-1, 1))
X_train.duration = scaler.fit_transform(X_train.duration.values.reshape(-1, 1))
X_test.duration = scaler.fit_transform(X_test.duration.values.reshape(-1, 1))
X_train.reward = scaler.fit_transform(X_train.reward.values.reshape(-1, 1))
X_test.reward = scaler.fit_transform(X_test.reward.values.reshape(-1, 1))
X_train.gender = scaler.fit_transform(X_train.gender.values.reshape(-1, 1))
X_test.gender = scaler.fit_transform(X_test.gender.values.reshape(-1, 1))
X_train.offer_type = scaler.fit_transform(X_train.offer_type.values.reshape(-1, 1))
X_test.offer_type = scaler.fit_transform(X_test.offer_type.values.reshape(-1, 1))
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = testandtrain(X,y)
X_train.shape
(74402, 10)
y_train.shape
(74402, 1)
X_test.shape
(74403, 10)
y_test.shape
(74403, 1)
X_train.describe()
| gender | age | became_member_on | income | time | offer_id | reward | difficulty | duration | offer_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 | 74402.000000 |
| mean | 0.564387 | 0.442281 | 0.715704 | 0.404558 | 0.497223 | 0.522162 | 0.444792 | 0.393850 | 0.517254 | 0.639553 |
| std | 0.491918 | 0.206864 | 0.237453 | 0.238705 | 0.277118 | 0.311485 | 0.337052 | 0.251632 | 0.304082 | 0.355390 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.301205 | 0.600000 | 0.233333 | 0.243697 | 0.222222 | 0.200000 | 0.250000 | 0.285714 | 0.500000 |
| 50% | 1.000000 | 0.445783 | 0.800000 | 0.388889 | 0.571429 | 0.555556 | 0.500000 | 0.500000 | 0.571429 | 0.500000 |
| 75% | 1.000000 | 0.578313 | 0.800000 | 0.566667 | 0.714286 | 0.777778 | 0.500000 | 0.500000 | 0.571429 | 1.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
sns.heatmap(X_test.corr(method="kendall"), annot=True, fmt="0.2f");
def train_test(model):
train_prediction = (model.fit(X_train, y_train)).predict(X_train)
test_predictions = (model.fit(X_train, y_train)).predict(X_test)
train_f1_score = accuracy_score(y_train, train_prediction) * 100
test_f1_score = fbeta_score(y_test, test_predictions, beta=0.5, average='micro') * 100
return train_f1_score, test_f1_score, test_predictions
dtc = DecisionTreeClassifier(random_state = 1)
dtc_train_f1_score, dtc_test_f1_score,dtc_test_predictions = train_test(dtc)
models = {'Model': ['DecisionTreeClassifier'],
'Train F1 score ':[dtc_train_f1_score],
'Test F1 score': [dtc_test_f1_score] }
results = pd.DataFrame(models)
results
| Model | Train F1 score | Test F1 score | |
|---|---|---|---|
| 0 | DecisionTreeClassifier | 94.631865 | 63.713829 |
X_test.head()
| gender | age | became_member_on | income | time | offer_id | reward | difficulty | duration | offer_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 93490 | 1.0 | 0.216867 | 1.0 | 0.200000 | 0.294118 | 0.666667 | 0.2 | 0.50 | 1.000000 | 0.5 |
| 79279 | 1.0 | 0.277108 | 0.0 | 0.544444 | 0.705882 | 0.555556 | 0.2 | 0.50 | 0.571429 | 0.5 |
| 5980 | 1.0 | 0.518072 | 0.2 | 0.233333 | 0.638655 | 0.000000 | 0.5 | 0.25 | 0.571429 | 1.0 |
| 124850 | 0.0 | 0.277108 | 0.6 | 0.233333 | 0.705882 | 0.888889 | 0.5 | 1.00 | 1.000000 | 0.5 |
| 46034 | 1.0 | 0.590361 | 0.6 | 0.655556 | 0.470588 | 0.333333 | 1.0 | 0.50 | 0.571429 | 1.0 |
y_test.head()
| event | |
|---|---|
| 93490 | 2 |
| 79279 | 0 |
| 5980 | 1 |
| 124850 | 0 |
| 46034 | 0 |
dtc_test_predictions
array([2, 0, 1, ..., 2, 0, 0], dtype=int64)
pred_out=pd.DataFrame(dtc_test_predictions, columns=['predicted_outcome'])
pred_out.head()
| predicted_outcome | |
|---|---|
| 0 | 2 |
| 1 | 0 |
| 2 | 1 |
| 3 | 0 |
| 4 | 1 |